[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL by wangzhigang1999 · Pull Request #7419 · apache/kyuubi

wangzhigang1999 · 2026-04-24T09:53:59Z

Why are the changes needed?

This is a bug fix.

The Hive JDBC driver rewrites a null schemaPattern to "%" in GetTables, and the JDBC engine forwards it directly into WHERE TABLE_SCHEMA LIKE '%', so DBeaver — which always passes null once a connection is established — sees tables from every database in the backend rather than the one in its connection URL. Compounding this, the JDBC engine never applies the URL-level database to the underlying JDBC connection, so there is no "current database" on the backend for a filter to fall back to. The fix introduces a dialect-level setCurrentDatabase hook, tracks an effectiveDatabase on the session, and routes blank/"%" schema patterns through it in GetTablesOperation. MySQLDialect additionally implements getCatalogsOperation / getSchemasOperation with JDBC-standard column labels (TABLE_CATALOG / TABLE_SCHEM) so the schemas tree in DBeaver populates correctly.

Non-MySQL dialects inherit the default no-op setCurrentDatabase and effectiveDatabase stays None, so PostgreSQL / Oracle / etc. keep their existing behavior.

Note: getSchemas() continues to return all accessible schemas per the JDBC spec — the URL-level database only affects the default query context and the blank/"%" fallback in getTables. This matches the behavior of the native MySQL JDBC driver. Clients that want to restrict the visible schema list should do so via client-side filters (e.g. DBeaver's "Schemas" filter).

How was this patch tested?

Added unit / integration tests in MySQLOperationSuite covering:

getTables with a database in the URL returns only that database's tables
getTables with a "%" schema pattern is routed to the session's effective database
getSchemas returns results with JDBC-standard TABLE_CATALOG / TABLE_SCHEM column labels
getCatalogs behavior

Also verified end-to-end against a live StarRocks backend using Spark's Hive-JDBC beeline:

# URL with a database → only that database's tables
jdbc:hive2://127.0.0.1:10009/doctor              → !tables : 3 rows, all TABLE_SCHEMA=doctor
jdbc:hive2://127.0.0.1:10009/information_schema  → !tables : only information_schema views

# URL without a database → previous no-filter behavior preserved
jdbc:hive2://127.0.0.1:10009/                    → !tables : tables from every database

# DatabaseMetaData.getSchemas() returns JDBC-standard columns
columns: TABLE_CATALOG, TABLE_SCHEM
rows:    def/doctor, def/information_schema, def/sys, ...

Was this patch authored or co-authored using generative AI tooling?

Human-authored core design and implementation. Tests, verification, and PR description drafting were assisted by Claude Code.

Generated-by: Claude Code (Opus 4.7)

Closes #7305. Supersedes the stale #7307 per @pan3793's suggestion there.
Co-authored-by: zhzhenqin zhzhenqin@users.noreply.github.com

…ses when a database is specified in the URL When DBeaver (or any Hive-JDBC client) connects to the Kyuubi JDBC engine with a database path in the URL, DatabaseMetaData.getTables should return only the tables from that database. Previously, the Hive JDBC driver converted a null schemaPattern to "%" before sending the request, which the engine forwarded straight into a WHERE TABLE_SCHEMA LIKE '%' clause on the backend, returning tables from every database. - Introduce JdbcDialect.setCurrentDatabase(connection, database) as an extension point; the default is a no-op returning false. MySQLDialect overrides it with USE `db` so MySQL/StarRocks/Doris switch the backend connection on session open. - JdbcSessionImpl records the successfully applied database in effectiveDatabase and only keeps it when setCurrentDatabase returns true. "default" (the Hive JDBC driver's fallback when no database is specified in the URL) is tolerated on USE failure to stay compatible with backends that have no "default" database. - JdbcOperationManager.newGetTablesOperation treats a blank schema or "%" as "use the session's effective database", routing through effectiveDatabase.orNull. Backends without an effective database (PostgreSQL, Oracle, etc.) keep the previous no-filter behavior. - MySQLDialect now implements getCatalogsOperation and getSchemasOperation (previously featureNotSupported). getSchemasOperation aliases result columns to the JDBC-standard TABLE_CATALOG / TABLE_SCHEM so DatabaseMetaData.getSchemas works with clients such as DBeaver. Closes apache#7305.

pan3793 · 2026-04-28T12:13:49Z

+    } else {
+      schemaName
+    }
+    val query = dialect.getTablesQuery(catalogName, effectiveSchema, tableName, tableTypes)


I don't recall the details of the initial code of this part, but the API looks problematic.

the Hive Thrift GetTablesOperation API is derived from the JDBC API, and that's why they look almost identical, for JDBC engine, such a call should directly map to the JDBC API instead of requesting a SQL (if you take a look at the JDBC driver implementation, you may find some impls are non-SQL-based), the current design complicates the whole thing.

Thanks for raising this — you're right, and I want to confirm I took it seriously rather than hand-wave it away. To make sure I understood, I ran the full set of DatabaseMetaData calls against three different drivers (mysql-connector-j 8.4.0, postgresql 42.7.2, clickhouse-jdbc 0.6.5) using a small Java probe and docker containers. Test fixtures were tailored to each backend (MySQL/PG had PK + FK + a function; ClickHouse used MergeTree which has no native PK/FK).

Cross-driver behavior of each API:

API MySQL (databaseTerm=CATALOG) MySQL (databaseTerm=SCHEMA) PostgreSQL ClickHouse

getCatalogs() ✓ implemented ⚠ empty (driver databaseTerm selection) ✓ implemented ✓ implemented

getSchemas() ⚠ empty (driver databaseTerm selection) ✓ implemented ✓ implemented — model has no schema layer

getTableTypes() ✓ ✓ ✓ ✓

getTables(...) ✓ ✓ ✓ ✓

getColumns(...) ✓ ✓ ✓ ✓

getPrimaryKeys ✓ ✓ ✓ — MergeTree has no PK

getImported / Exported / CrossReference ✓ ✓ ✓ — no FK in CK

getFunctions(...) ✓ ✓ ✓ (no fixture created)

getTypeInfo() ✓ ✓ ✓ ✓

Things that hold across all three drivers:

Every JDBC-standard metadata method is implemented — none threw SQLFeatureNotSupportedException; concepts that don't apply just return empty result sets. (The JDBC contract requires this; clients differ in how strictly they enforce it.)

Column names follow the JDBC spec (TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, KEY_SEQ, PK_NAME, ...) — which is exactly what GetTablesOperation / GetColumnsOperation / etc. on the Hive Thrift side expect, since the Thrift API was derived from the JDBC API in the first place.

Driver-specific quirks are minor and well-defined — the databaseTerm selection is unique to mysql-connector-j; PG returns column labels in lowercase while MySQL/CK use uppercase (JDBC spec says case-insensitive, but engine-side string comparisons would need to match).

So the JDBC standard API really does cover every operation the dialect models today, across very different DBMS families. Concretely, this means the JdbcDialect.getXxxQuery family of methods could be deleted in favor of the engine calling connection.getMetaData.getXxx(...) directly — and dialects whose driver implements metadata via a non-SQL path (Avatica, etc.) would automatically be used the way they were designed to.

The few wrinkles I hit are all driver-side and resolvable in a few lines of engine-level adapter code, not by reaching for SQL:

getCatalogs / getSchemas are selected by mysql-connector-j's databaseTerm property (only one returns rows in a given mode) — handled by checking the property and routing, or by calling both and taking whichever is non-empty.

DATA_TYPE is int in the JDBC spec but the current SQL returns a string — java.sql.Types-to-name mapping is one helper.

The current SQL returns extra MySQL-specific columns (ENGINE, TABLE_ROWS, ...) but the Hive Thrift GetTables result schema is fixed to JDBC-standard columns, so those extras are already dropped during serialization today and aren't reaching clients — no real loss.

I'd like to keep this PR scoped to the URL-database bug since that's what it set out to fix, and migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test — not something I'd want to fold into a bug fix. Happy to open a follow-up issue tracking the migration if you agree with that split. WDYT?

migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test

if we do that first, do we still need this patch?

No SQLFeatureNotSupportedException anywhere

Actually, the JDBC API Javadocs define behavior in a very clear way, it should not throw SQLFeatureNotSupportedException unless Javadocs say it can. Some tools, like DBeaver, may tolerate it if the JDBC driver vendor does not respect the API contract, but tools like DataGrip may not.

BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)

migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test

if we do that first, do we still need this patch?

Yes — the bug is the Hive JDBC client converting null schemaPattern to "%", and JDBC spec says "%" means "don't narrow by schema", so calling DatabaseMetaData.getTables(...) directly reproduces it. Verified on mysql-connector-j 8.4.0 (connection opened against jdbc:mysql://.../dbA):

conn.getCatalog() == "dbA" getTables(null, "%", "%", null) → tables from dbA AND dbB ← bug reproduces getTables("dbA", null, "%", null) → tables from dbA only ← only narrows when explicit

So the engine still has to substitute the URL-scoped database into the call when the client sends null/"%". That routing logic (effectiveDatabase + the if-blank-then-route block in JdbcOperationManager) survives the refactor unchanged — only the line that builds a SQL string flips to conn.getMetaData.getTables(...). The setSchema-on-session-open machinery is also independent.

What the refactor does subsume: the two getCatalogsOperation / getSchemasOperation SQL implementations I added to MySQLDialect — those go away once the engine calls getCatalogs() / getSchemas() directly.

Happy either way on ordering — bug fix first then refactor, or refactor first and a smaller patch on top.

BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)

Agreed, and thanks for catching this early. I've rewritten all my prior PR comments and the in-tree comment in MySQLDialect.setSchema to remove any reference / paraphrase of mysql-connector-j source — keeping only the documented databaseTerm property and black-box test observations. Also amended the FOLLOWUP commit message similarly. Will follow this rule going forward.

…tandard setCatalog/setSchema Replace the hand-rolled `USE \`db\`` with `Connection.setCatalog` plus `Connection.setSchema` in MySQLDialect.setCurrentDatabase. The MySQL Connector/J `databaseTerm` connection property determines which of the two calls actually switches the session database; invoking both ensures the switch happens regardless of how the user configured the URL, while staying on standard JDBC API instead of hand-rolled SQL. StarRocksDialect / DorisDialect inherit this via MySQLDialect. Also add three StarRocks regression tests covering: scoping when URL specifies a database, no-filter behavior when URL has no database, and session-open failure on a non-existent database.

…Dialect Per review feedback, replicate the JDBC `Connection#setSchema(String)` API on JdbcDialect with a leading `conn` parameter. The default forwards to the standard API; dialects whose driver does not support it can override to a no-op. Drop the boolean return value — callers treat the call as if it always takes effect. JdbcSessionImpl now records `effectiveDatabase` unconditionally after the call, decoupling "backend connection switched" from "engine metadata scope".

…ession open The Hive JDBC driver sends `USE_DATABASE = "default"` when the user did not specify a database in the connection URL — a protocol-level stub indistinguishable from a genuine request for a database literally named "default". Forwarding that into `setSchema` plus recording it as `effectiveDatabase` caused PG and ClickHouse metadata operations to scope to a non-existent or unintended schema, breaking the existing `get tables` suites in CI. Filter `"default"` at the session entry point: skip both the dialect `setSchema` call and the `effectiveDatabase` assignment for it. Drops the prior try/catch that only existed to tolerate the failure path. Also rewords the in-tree `MySQLDialect.setSchema` comment to describe the externally observed behavior of the documented `databaseTerm` connection property, without paraphrasing driver internals. URL paths that name a real non-`default` database keep the new scoping behavior; URL paths that explicitly name `default` continue to behave as before this PR (no metadata scope), which is the same as URL paths with no database — there is no observable regression.

wangzhigang1999 · 2026-04-29T05:28:25Z

Where the patch stands now

Posting a wrap-up. Want to flag one decision up-front since it's the least obvious from the diff and the most likely thing to want a sanity-check.

Heads-up: how `"default"` is handled at session open

The Hive JDBC driver sends USE_DATABASE = "default" as a stub when the user did not specify a database in the URL — there is no protocol-level signal that distinguishes that stub from a genuine request for a database literally named default. After the FOLLOWUP that turned setSchema into the unconditional path (per your review feedback), this caused the PG and ClickHouse get tables suites to fail in CI: their drivers silently accept setSchema("default") (PG sets search_path to a non-existent schema, CH actually has a default database), so the engine then scoped metadata to a non-existent or unintended schema.

The fix is to filter "default" at the session entry point. The trade-off, made explicit:

URL form	Before this PR	After this PR
no database in URL	no metadata scope	no metadata scope (unchanged)
URL specifies a non-`default` database	no metadata scope (the original bug)	scoped to that database ✓
URL explicitly names `default` (and the backend has one)	no metadata scope	no metadata scope (no change)

No observable regression — the third row was already in this state before this PR (the engine never URL-scoped at all). Users on backends with a real default schema can still target it after connecting via USE default;, which goes through a different code path (newSetCurrentDatabaseOperation). I considered alternatives (try/catch around setSchema, or "verify the schema exists before recording it") but the first is brittle on PG/CH, and the second still does the wrong thing on ClickHouse since CH genuinely has a default database.

The rest of the patch, briefly

Three engine-side layers:

Dialect API — JdbcDialect.setSchema(conn, schema): Unit mirrors Connection#setSchema with a leading conn; default forwards to the standard. Only MySQLDialect overrides — calls both setCatalog + setSchema so the database switch happens regardless of the databaseTerm connection property the user configured.
Session open — invokes the dialect's setSchema for the URL-scoped database (modulo the "default" stub above) and records it as effectiveDatabase.
Operation routing — JdbcOperationManager.newGetTablesOperation substitutes effectiveDatabase when the client sends blank or "%" schemaPattern, undoing the Hive driver's null → "%" translation.

Tracked separately

Migrating the dialect API from "return a SQL string" to "engine calls DatabaseMetaData.getXxx(...) directly" — your point on the GetTablesOperation thread, with cross-driver verification confirming feasibility on MySQL / PG / CH. This PR's URL-database routing logic survives that refactor unchanged. Will open a separate issue.

pan3793 · 2026-04-29T09:29:12Z

Can we do mapping DatabaseMetaData.getXxx(...) first? I think this fix will be clearer after that - each Thrift request will exactly match the JDBC DatabaseMetaData API, no need to transform the parameter value.

wangzhigang1999 · 2026-04-29T09:35:45Z

Happy to do the refactor first if that's the preferred order. One thing to confirm though, since I want to make sure we're aligned: as I noted above, the URL-database routing (effectiveDatabase substitution when the client sends null / "%" schemaPattern) is still required after the refactor — verified empirically on mysql-connector-j 8.4.0, where conn.getMetaData.getTables(null, "%", "%", null) on a connection opened against /dbA still returns tables from all DBs. The refactor cleans up the SQL-string dialect methods but does not subsume this fix.

Given that, do you still prefer refactor-first, or merge this as the minimal bug fix and follow up with the refactor? Either order works for me.

pan3793 · 2026-04-29T09:44:52Z

the JDBC API Javadocs say

"" retrieves those without a schema; null means that the schema name should not be used to narrow the search

I interpret it to

"": filter with TABLE_SCHEM = <current schema>
null: retrieve all schemas
"%": filter with TABLE_SCHEM LIKE '%', same as null
others, actually include "%": filter with TABLE_SCHEM LIKE '<pattern>'

I don't understand why effectiveDatabase substitution is required here?

pan3793 · 2026-04-29T09:49:59Z

Note, some JDBC clients (e.g., Cloudera HUE) use * instead of % or null to match all patterns, which is not supported by the JDBC standard, but is supported by many RDBMS, and the implementations are various, either on the JDBC driver side or server side. IIRC, HiveServers and Kyuubi support * for HUE compatibility, but Spark Thrift Server does not.

wangzhigang1999 · 2026-04-29T12:05:41Z

You're right, and I want to walk back my earlier claim that the substitution is required. I tested this end-to-end against master (without this PR) using DBeaver and a real MySQL backend, and traced what actually triggers the bug:

On master, MySQLDialect does not override getCatalogsOperation / getSchemasOperation, so the JdbcDialect defaults take over and throw KyuubiSQLException.featureNotSupported(). The result on the wire (DBeaver metadata refresh):

-- Load catalogs   → throws featureNotSupported
-- Load schemas    → throws featureNotSupported
-- Load tables [null, null, %, null]   ← falls back to "give me everything"
-- Load type info

So the root cause of KYUUBI-7305 is that DBeaver fails to obtain a database list and falls back to getTables(null, null, ...), which Hive client rewrites to schemaPattern="%" and the engine treats per spec as "all schemas". On the same setup with this PR applied, DBeaver gets a real catalog/schema list and switches to issuing scoped calls like getTables(null, "dataeye_starpony", "%", null) on its own — no engine-side substitution needed.

That makes the effectiveDatabase substitution in this PR redundant in practice — your spec interpretation is correct, and the fix users actually need is the getCatalogsOperation / getSchemasOperation implementations plus setSchema at session open. I'll drop the substitution layer and keep the rest, then we can move on to the dialect-API refactor you outlined as the proper follow-up. Apologies for the runaround on this thread.

pan3793 · 2026-04-29T12:12:15Z

the fix users actually need is the getCatalogsOperation / getSchemasOperation implementations plus setSchema at session open.

yes, this meets my expectations exactly.

wangzhigang1999 · 2026-04-29T12:15:29Z

Going to close this PR — the discussion drifted from the actual root cause, and the substitution mechanism here turned out to be unnecessary as you pointed out. Apologies for the noise on this thread; I should have traced the failure path empirically much earlier instead of arguing from incomplete reasoning, and I appreciate your patience walking me back to it.

I'll open a new PR taking the direction you've been suggesting all along: align the dialect API with JDBC semantics by migrating the hand-written SQL methods (getCatalogsOperation / getSchemasOperation / getTablesQuery / getColumnsQuery / etc.) to call connection.getMetaData.getXxx(...) directly, plus implement setSchema at session open so the URL-scoped database is honored on the connection. That's both more uniform and properly fixes KYUUBI-7305 — DBeaver gets a real catalog/schema list back instead of featureNotSupported, and stops falling back to getTables(null, null, ...).

github-actions Bot added the module:jdbc label Apr 24, 2026

wangzhigang1999 marked this pull request as ready for review April 24, 2026 10:54

pan3793 reviewed Apr 27, 2026

View reviewed changes

Comment thread ...ls/kyuubi-jdbc-engine/src/main/scala/org/apache/kyuubi/engine/jdbc/dialect/JdbcDialect.scala Outdated

Trigger CI

dd302b9

pan3793 reviewed Apr 28, 2026

View reviewed changes

Comment thread ...s/kyuubi-jdbc-engine/src/main/scala/org/apache/kyuubi/engine/jdbc/dialect/MySQLDialect.scala

pan3793 reviewed Apr 28, 2026

View reviewed changes

wangzhigang1999 added 3 commits April 28, 2026 22:47

wangzhigang1999 force-pushed the KYUUBI-7305 branch from 1edb352 to f10df51 Compare April 28, 2026 14:48

wangzhigang1999 closed this Apr 29, 2026

API	MySQL (`databaseTerm=CATALOG`)	MySQL (`databaseTerm=SCHEMA`)	PostgreSQL	ClickHouse
`getCatalogs()`	✓ implemented	⚠ empty (driver `databaseTerm` selection)	✓ implemented	✓ implemented
`getSchemas()`	⚠ empty (driver `databaseTerm` selection)	✓ implemented	✓ implemented	— model has no schema layer
`getTableTypes()`	✓	✓	✓	✓
`getTables(...)`	✓	✓	✓	✓
`getColumns(...)`	✓	✓	✓	✓
`getPrimaryKeys`	✓	✓	✓	— MergeTree has no PK
`getImported / Exported / CrossReference`	✓	✓	✓	— no FK in CK
`getFunctions(...)`	✓	✓	✓	(no fixture created)
`getTypeInfo()`	✓	✓	✓	✓

Conversation

wangzhigang1999 commented Apr 24, 2026

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

pan3793 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

wangzhigang1999 Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

wangzhigang1999 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

wangzhigang1999 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

wangzhigang1999 commented Apr 29, 2026

Where the patch stands now

Heads-up: how "default" is handled at session open

The rest of the patch, briefly

Tracked separately

Uh oh!

pan3793 commented Apr 29, 2026

Uh oh!

wangzhigang1999 commented Apr 29, 2026

Uh oh!

pan3793 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangzhigang1999 commented Apr 29, 2026

Uh oh!

pan3793 commented Apr 29, 2026

Uh oh!

wangzhigang1999 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangzhigang1999 Apr 28, 2026 •

edited

Loading

pan3793 Apr 28, 2026 •

edited

Loading

Heads-up: how `"default"` is handled at session open

pan3793 commented Apr 29, 2026 •

edited

Loading

pan3793 commented Apr 29, 2026 •

edited

Loading